Flow based reply cache

ABSTRACT

A flow based reply cache of a storage system is illustratively organized into one or more microcaches, each having a plurality of reply cache entries. Each microcache is maintained by a protocol server executing on the storage system and is allocated on a per client basis. To that end, each client is identified by a client connection or logical “data flow” and is allocated its own microcache and associated entries, as needed. As a result, each microcache of the reply cache may be used to identify a logical stream of client requests associated with a data flow, as well as to isolate that client stream from other client streams and associated data flows used to deliver other requests served by the system. The use of microcaches thus provides a level of granularity that enables each client to have its own pool of reply cache entries that is not shared with other clients, thereby obviating starvation of entries allocated to the client in the reply cache.

FIELD OF THE INVENTION

The present invention relates to storage systems and, more specifically,to a reply cache used in a storage system.

BACKGROUND OF THE INVENTION

A storage system is a computer that provides storage services relatingto the organization of information on writeable persistent storagedevices, such as non-volatile memories and/or disks. The storage systemtypically includes a storage operating system that implements a filesystem to logically organize the information as a hierarchical structureof data containers, such as files and directories on, e.g., the disks.Each “on-disk” file may be implemented as set of data structures, e.g.,disk blocks, configured to store information, such as the actual datafor the file. A directory, on the other hand, may be realized as aspecially formatted file in which information about other files anddirectories are stored.

The storage system may be further configured to operate according to aclient/server model of information delivery to thereby allow manyclients to access files and directories stored on the system. In thismodel, the client may comprise an application executing on a computerthat “connects” (i.e., via a client connection) to the storage systemover a computer network, such as a point-to-point link, shared localarea network, wide area network or virtual private network implementedover a public network, such as the Internet. Each client may request theservices of the storage system by issuing file system protocol messagesor requests, such as the conventional Network File System (NFS) protocolrequests, to the system over the client connection identifying one ormore files to be accessed. In response, the file system executing on thestorage system services the request and returns a reply to the client.

Broadly stated, the client connection is provided by a process of atransport layer, such as the Transmission Control Protocol (TCP) layer,of a protocol stack residing in the client and storage system. The TCPlayer processes establish the client (TCP) connection in accordance witha conventional “3-way handshake” arrangement involving the exchange ofTCP message or segment data structures. The resulting TCP connection isa reliable, securable logical circuit that is generally identified byport numbers and Internet Protocol (IP) addresses of the client andstorage system. The TCP protocol and establishment of a TCP connectionare well-known and described in Computer Networks, 3rd Edition,particularly at pgs. 521-542.

Many versions of the NFS protocol utilize reply caches for theiroperation. A reply cache may serve many purposes, one of which is toprevent re-execution (replay) of non-idempotent operations byidentifying duplicate requests. By caching reply information for suchoperations, replies to duplicate requests may be rendered from cachedinformation, as opposed to re-executing the operation with the filesystem. For example, assume a client issues an NFS request to thestorage system, wherein the request contains a non-idempotent operation,such as a rename operation that renames, e.g., file A to file B. Assumefurther that the file system receives and processes the request, but thereply to the request is lost or the connection to the client is broken.A reply is thus not returned to the client and, as a result, the clientresends the request. The file system then attempts to process the renamerequest again but, since file A has already been renamed to file B, thesystem returns a failure, e.g., an error reply, to the client (eventhough the operation renaming file A to file B had been successfullycompleted). A reply cache attempts to prevent such failures by recordingthe fact that the particular request was successfully executed, so thatif it were to be reissued for any reason, the same reply will be resentto the client (instead of re-executing the previously executed request,which could result in an inappropriate error reply).

Another purpose of the reply cache is to provide a performanceimprovement through work-avoidance by tracking “in-progress” requests.When using an unreliable transport protocol, such as the User DatagramProtocol (UDP), the client typically retransmits a subsequent NFSrequest if a response is not received from the storage system uponexceeding a threshold (e.g., one second) after transmission of aninitial NFS request. For an NFS request containing an idempotentoperation having a large reply, such as read or readdir operation, theactual processing of the request by the file system could exceed thisthreshold for retransmission. Such in-progress requests are tracked sothat any duplicate requests received by the system are discarded(“dropped”) instead of processing duplicate file operations contained inthe requests. This work-avoidance technique provides a noticeableperformance improvement for the NFS protocol over the UDP protocol.

A known implementation of an NFS reply cache is described in a papertitled Improving the Performance and Correctness of an NFS Server, byChet Juszczak, Winter 1989 USENIX Conference Proceedings, USENIXAssociation, Berkeley, Calif., February 1989, pgs 53-63. Broadly stated,this implementation places reply cache entries into a “global leastrecently used (LRU)” data structure, i.e., a list ordered by a lastmodified time for each entry. In response to processing of a new NFSrequest from a client, a protocol server, e.g., an NFS server, executingon the storage system removes the oldest (thus, least recently used)entry from the list, clears its reply data and assigns the entry to thenew request (thus invalidating the old cache entry). The reply cacheimplementation accords equal weight to all cached NFS replies and cachemanagement is predicated on maintaining a complete record of the mostrecent replies in the reply cache using an LRU algorithm.

In general, clients utilizing the NFS protocol over the TCP protocol canretransmit NFS requests (if responses are not received from the storagesystem) a substantially long period of time after transmission of theirinitial requests. Such long retransmit times often result in activeclients “starving” slower/retransmitting clients of entries in the replycache, such that it is unlikely that a retransmitted duplicatenon-idempotent request (in a deployment using NFS over TCP) will befound in a global LRU reply cache. The ensuing cache miss results in areplay of the non-idempotent operation and, potentially, datacorruption.

SUMMARY OF THE INVENTION

The present invention overcomes the disadvantages of the prior art byproviding a flow based reply cache of a storage system. The flow basedreply cache is illustratively organized into one or more microcaches,each having a plurality of reply cache entries. Each microcache ismaintained by a protocol server executing on the storage system and isallocated on a per client basis. To that end, each client is identifiedby a client connection or logical “data flow” and is allocated its ownmicrocache and associated entries, as needed. As a result, eachmicrocache of the reply cache may be used to identify a logical streamof client requests associated with a data flow, as well as to isolatethat client stream from other client streams and associated data flowsused to deliver other requests served by the system. The use ofmicrocaches thus provides a level of granularity that enables eachclient to have its own pool of reply cache entries that is not sharedwith other clients, thereby obviating starvation of entries allocated tothe client in the reply cache.

In an illustrative embodiment, each client creates a client connection(e.g., a TCP connection) with a protocol server (e.g., an NFS server)executing on the storage system to issue requests (e.g., NFS requests)of a logical stream to the server. In response to creating theconnection or data flow associated with the client, the NFS serverallocates a microcache to the data flow. The microcache isillustratively embodied as a “bin” having allocated entries or “buckets”into which are loaded replies associated with the requests of the streamissued by the client. The depth of the microcache illustrativelycomprises an estimated number of allocated entries that is managed usinga predetermined policy.

The flow based reply cache illustratively includes a data structure,e.g., a flow look-up table, having a plurality of entries, each of whichcontains a reference (e.g., a logical data flow) to a microcache. Eachlogical data flow is represented by a flow data structure comprising twoparts: (i) an identifier component used by the NFS server to identify aparticular logical data flow, and (ii) a main body component containingthe actual reply cache information and statistics for the data flow. Atthe core of each main body component is a microcache look-up table usedto either locate free, available entries for the microcache within aleast recently used list or identify in progress entries within anin-progress request list allocated to each data flow.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the invention may be betterunderstood by referring to the following description in conjunction withthe accompanying drawings in which like reference numerals indicateidentical or functionally similar elements:

FIG. 1 is a schematic block diagram of an environment including astorage system that may be advantageously used with the presentinvention;

FIG. 2 is a schematic block diagram of a storage operating system thatmay be advantageously used with the present invention;

FIG. 3 is a schematic block diagram illustrating a flow based replycache according to the present invention; and

FIGS. 4A and 4B are flowcharts illustrating an operational procedure forthe flow based reply cache according to the present invention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

The present invention is directed to a flow based reply cache having anarchitecture that provides an improvement to the problematic global LRUreply cache implementation of the prior art. Instead of using a globalLRU implementation, cache entries are stored on a per-client connectionbasis. Note that, as used herein, the term “client connection” denoteseither a TCP connection, UDP packets grouped into the same logicalnetwork data flow or any other identifier, derived from underlyingprotocols, used to differentiate data sent from a client. A protocolserver (e.g., a NFS server) executing on a storage system maintains amicrocache for each client connection (or logical data flow) to storeand retrieve recent replies for requests (e.g., NFS requests) issued bya client. Each microcache is dynamically managed and tuned for each dataflow. By maintaining a microcache area for each client connection in anon-global manner, the NFS server can prevent an aggressive client fromstarving a comparatively inactive client of reply cache resources. Thearchitecture of the flow based reply cache also allows the server tokeep statistics for each client and use that information to tailor eachcache to the needs of a client.

FIG. 1 is a schematic block diagram of an environment 100 including astorage system that may be advantageously used with the presentinvention. The storage system 120 is a computer that provides storageservices relating to the organization of information on writablepersistent storage devices, such as disks 130 of disk array 135. To thatend, the storage system 120 comprises a processor 122, a memory 124, anetwork adapter 126, a storage adapter 128 and non-volatile memory 140interconnected by a system bus 125. The storage system 120 also includesa storage operating system 200 that implements a virtualization systemto logically organize the information as a hierarchical structure ofdata containers, such as files, directories and logical units, on thedisks 130.

The memory 124 comprises storage locations that are addressable by theprocessor and adapters for storing software programs and data structuresassociated with the embodiments described herein. The processor andadapters may, in turn, comprise processing elements and/or logiccircuitry configured to execute the software programs and manipulate thedata structures. The storage operating system 200, portions of which aretypically resident in memory and executed by the processing elements,functionally organizes the storage system by, inter alia, invokingstorage operations in support of software processes executing on thesystem. It will be apparent to those skilled in the art that otherprocessing and memory means, including various computer readable media,may be used to store and execute program instructions pertaining to theinventive technique described herein.

The non-volatile memory 140 comprises electronic storage illustrativelyembodied as a solid-state, non-volatile random access memory (NVRAM)array having either a back-up battery or other built-inlast-state-retention capabilities (e.g., non-volatile semiconductormemory) that hold the last state of the memory in the event of any powerloss to the array. As described herein, a portion of the non-volatilememory 140 is organized as temporary, yet persistent, non-volatile logstorage (NVLOG 150) capable of maintaining information in light of afailure to the storage system.

The network adapter 126 comprises the mechanical, electrical andsignaling circuitry needed to connect the storage system 120 to a client110 over a computer network 160, which may comprise a point-to-pointconnection or a shared medium, such as a local area network. The client110 may be a general-purpose computer configured to execute applications112, such as a database application. Moreover, the client 110 mayinteract with the storage system 120 in accordance with a client/servermodel of information delivery. That is, the client may request theservices of the storage system, and the system may return the results ofthe services requested by the client, by exchanging packets over thenetwork 160. The clients may issue packets including file-based accessprotocols, such as the Common Internet File System (CIFS) protocol orNFS protocol, over TCP/IP when accessing information in the form offiles. Alternatively, the client may issue packets including block-basedaccess protocols, such as the Small Computer Systems Interface (SCSI)protocol encapsulated over TCP (iSCSI) and SCSI encapsulated over FC(FCP), when accessing information in the form of blocks.

The storage adapter 128 cooperates with the storage operating system 200executing on the storage system to access information requested by theclient. The information may be stored on the disks 130 of the disk array135 or other similar media adapted to store information. The storageadapter includes input/output (I/O) interface circuitry that couples tothe disks 130 over an I/O interconnect arrangement, such as aconventional high-performance, Fibre Channel serial link topology. Theinformation is retrieved by the storage adapter and, if necessary,processed by the processor 122 (or the adapter 128) prior to beingforwarded over the system bus 125 to the network adapter 126, where theinformation is formatted into a packet and returned to the client 110.

The disks 130 of the array are illustratively organized as one or moregroups, wherein each group may be operated as a Redundant Array ofIndependent (or Inexpensive) Disks (RAID). Most RAID implementationsenhance the reliability/integrity of data storage through the redundantwriting of data “stripes” across a given number of physical disks in theRAID group, and the appropriate storing of parity information withrespect to the striped data. An illustrative example of a RAIDimplementation is a RAID-4 level implementation, although it should beunderstood that other types and levels of RAID implementations, as wellas other forms of redundancy, may be used in accordance with theinventive principles described herein.

FIG. 2 is a schematic block diagram of the storage operating system 200that may be advantageously used with the present invention. In anillustrative embodiment described herein, the storage operating systemis preferably the NetApp® Data ONTAP™ operating system available fromNetwork Appliance, Inc., Sunnyvale, Calif. that implements a WriteAnywhere File Layout (WAFL™) file system. However, it is expresslycontemplated that any appropriate storage operating system may beenhanced for use in accordance with the inventive principles describedherein. As such, where the term “WAFL” is employed, it should be takenbroadly to refer to the file system component of any storage operatingsystem that is otherwise adaptable to the teachings of this invention.

The storage operating system comprises a series of software layers,including a network driver layer (e.g., a media access layer 202, suchas an Ethernet driver), network protocol layers (e.g., the IP layer 204and its supporting transport mechanisms, the UDP layer 206 and the TCPlayer 208), as well as a protocol server layer (e.g., a NFS server 212,a CIFS server 214, etc.) and a presentation layer configured to providebindings for the transport mechanisms (e.g., a RPC layer 216) organizedas a network protocol stack 210. In addition, the storage operatingsystem 200 includes a disk storage layer 220 that implements a diskstorage protocol, such as a RAID protocol, and a disk driver layer 230that implements a disk access protocol such as, e.g., a SCSI protocol.

Bridging the disk software layers with the network and protocol serverlayers is a virtualization system that may be abstracted through the useof a database management system, a volume manager or, as describedherein, a file system 240. The file system 240 illustratively provideslogical volume management capabilities for use in access to theinformation stored on the storage devices, such as non-volatile memory140 and disks 130. That is, in addition to providing file systemsemantics, the file system 240 provides functions normally associatedwith a volume manager. These functions include (i) aggregation of thedisks, (ii) aggregation of storage bandwidth of the disks, and (iii)reliability guarantees, such as mirroring and/or parity (RAID).

The file system 240 illustratively implements the WAFL file systemhaving an on-disk format representation that is block-based using, e.g.,4 kilobyte (kB) blocks and using index nodes (“inodes”) to identifyfiles and file attributes (such as creation time, access permissions,size and block location). The file system uses files to store meta-datadescribing the layout of its file system; these meta-data files include,among others, an inode file. A file handle, i.e., an identifier thatincludes an inode number, is used to retrieve an inode from disk.

Operationally, a request from the client 110 is forwarded as one or morepackets over the computer network 160 and onto the storage system 120where it is received at the network adapter 126. A network driver of theprotocol stack 210 processes the packet and, if appropriate, passes iton to a network protocol and protocol server layer for additionalprocessing prior to forwarding to the file system 240. Here, the filesystem generates operations to load (retrieve) the requested data fromdisk if it is not resident “in core”, i.e., in the memory 124. If theinformation is not in the memory, the file system 240 indexes into theinode file using the inode number to access an appropriate entry andretrieve a logical volume block number (vbn). The file system thenpasses a message structure including the logical vbn to the disk storagelayer 220; the logical vbn is mapped to a disk identifier and physicalblock number (disk,pbn) and sent to an appropriate driver (e.g., SCSI)of the disk driver layer 230. The disk driver accesses the pbn from thespecified disk and loads the requested data block(s) in the memory 124for processing by the storage system. Upon completion of the request,the storage system (and operating system) returns a reply to the client110 over the network 160.

It should be noted that the software “path” through the storageoperating system layers described above needed to perform data storageaccess for the client request received at the storage system mayalternatively be implemented in hardware. That is, in an alternateembodiment of the invention, a storage access request data path may beimplemented as logic circuitry embodied within a field programmable gatearray (FPGA) or an application specific integrated circuit (ASIC). Thistype of hardware implementation increases the performance of the storageservice provided by storage system 120 in response to a request issuedby client 110. Moreover, in another alternate embodiment of theinvention, the processing elements of adapters 126, 128 may beconfigured to offload some or all of the packet processing and storageaccess operations, respectively, from processor 122, to thereby increasethe performance of the storage service provided by the system. It isexpressly contemplated that the various processes, architectures andprocedures described herein can be implemented in hardware, firmware orsoftware.

As used herein, the term “storage operating system” generally refers tothe computer-executable code operable on a computer to perform a storagefunction that manages data access and may, in the case of a storagesystem 120, implement data access semantics of a general purposeoperating system. The storage operating system can also be implementedas a microkernel, an application program operating over ageneral-purpose operating system, such as UNIX® or Windows NT®, or as ageneral-purpose operating system with configurable functionality, whichis configured for storage applications as described herein.

In addition, it will be understood to those skilled in the art that theinvention described herein may apply to any type of special-purpose(e.g., file server, filer or storage serving appliance) orgeneral-purpose computer, including a standalone computer or portionthereof, embodied as or including a storage system. Moreover, theteachings of this invention can be adapted to a variety of storagesystem architectures including, but not limited to, a network-attachedstorage environment, a storage area network and disk assemblydirectly-attached to a client or host computer. The term “storagesystem” should therefore be taken broadly to include such arrangementsin addition to any subsystems configured to perform a storage functionand associated with other equipment or systems.

Flow Based Reply Cache

As noted, the present invention is directed to a flow based reply cacheof a storage system. The flow based reply cache is illustrativelyorganized into one or more microcaches, each having a plurality of replycache entries. Each microcache is maintained by a protocol serverexecuting on the storage system and is allocated on a per client basis.To that end, each client is identified by its client connection orlogical “data flow” (e.g., client (source) and storage system(destination) connection identifiers) and is allocated its ownmicrocache and associated entries, as needed. As used herein, aconnection identifier refers to a token which uniquely identifies aclient's logical request stream. The token is derived from informationdescribing the client's association with the server. This informationincludes, but is not limited to, explicit identifiers and/or transportidentifiers, such as network (IP) addresses, network ports and transportprotocols. As a result, each microcache of the reply cache may be usedto identify a logical stream of client requests associated with a dataflow, as well as to isolate that client stream from other client streamsand associated data flows used to deliver other requests served by thesystem. The use of microcaches thus provides a level of granularity thatenables each client to have its own pool of reply cache entries that isnot shared with other clients, thereby obviating starvation of entriesallocated to the client in the reply cache.

The flow based reply cache is illustratively implemented in memory 124and has an in-core structure configured for use by the protocol serverwhen cooperating with the file system 240. The file system, in turn,operates in an integrated manner with the use of non-volatile memory140, a portion of which is organized as the NVLOG 150. Many requestsexecuted (processed) by the file system 240 are recorded in the NVLOG,with each request being considered complete once the NVLOG record ismarked complete. Execution of these requests generally requires sometype of state change and, as such, the requests are considerednon-idempotent requests including, e.g., rename requests.

As an example, assume the file system executes a client request(forwarded by the protocol server of the network protocol stack 210) torename a file from A to B. Broadly stated, the file system 240 executes(processes) the request by, e.g., retrieving appropriate blocks of adirectory from disk 130, loading the blocks into the memory 124 andchanging (modifying) the blocks, including an appropriate block (entry)of the directory to reflect renaming of the file to B. The file systemthen marks the modified memory (e.g., buffer cache) blocks, includingthe directory entry block that now contains the name B for the file, as“dirty” so that they may be written to disk. At this point, the filesystem 240 does not write the dirty blocks to disk, but instead waitsuntil execution of a consistency model event, e.g., a consistency point(CP), of the system.

Meanwhile, the file system creates a file system operation record of therequest and stores the record in the NVLOG 150. Subsequently during theCP, the contents of the record are not written (flushed) to disk, butrather the processing results of those contents (as represented in thedirty buffer cache blocks) are flushed to disk. That is, only the dirtybuffer cache blocks (and not the file system operation record) arewritten to disk. However, once the changes to be made to the file systemare essentially reflected in the file system operation record and storedin the NVLOG, processing of the request is considered complete and thefile system notifies the protocol server of such completion. Theprotocol server thereafter generates a reply containing informationindicating, e.g., a successful completion of the request, and returnsthe reply to the client 110. In addition, the protocol server stores thereply in the reply cache so that it can reply to any duplicate requestswithout consulting the file system.

In an illustrative embodiment, each client 110 creates a clientconnection (e.g., a TCP connection) with the protocol server (e.g., NFSserver 212) executing on the storage system 120 to issue requests (e.g.,NFS requests) of a logical stream to the server. In response to creatingthe connection or data flow associated with the client, the NFS serverallocates a microcache to the data flow. The microcache isillustratively embodied as a “bin” having allocated entries or “buckets”into which are loaded replies associated with the requests of the streamissued by the client. The depth of the microcache illustrativelycomprises an estimated number of allocated entries that is managed usinga predetermined policy, such as a least recently used (LRU) algorithm.

The NFS server 212 illustratively allocates each flow the same sizemicrocache, e.g., 100 entries, wherein each entry has a size sufficientto accommodate a request of the stream. In response to reception of anew request, the oldest unused entry in the microcache is discarded andused to accommodate/satisfy the new request. If all entries of themicrocache allocated by the server are currently in progress (i.e., all100 requests are currently being processed by the file system and havenot yet been updated with replies) and another request associated withthe flow is received at the server, the NFS server 212 may discard(“leak”) an entry from the cache, e.g., according to the LRU algorithm.However, such a situation may indicate that either the microcache is toosmall (the client is more active than estimated) or that there may be aproblem (particularly if the server has replied to some of the requests)with, e.g., the network. Thus, by examining client activity on a perflow basis, the server can determine the behavior and needs of theclient, e.g., whether the server is providing the necessaryservice/resources required by the client, which was absent in the priorart.

FIG. 3 is a schematic block diagram illustrating the flow based replycache 300 according to the present invention. The NFS server 212illustratively sorts and manages entries of the reply cache 300 bylogical data flow; accordingly, each logical data flow is associatedwith its own portion of the reply cache (i.e., a microcache) maintainedby the NFS server. To that end, the flow based reply cache includes adata structure, e.g., a flow look-up table 310, having a plurality ofentries 312, each of which contains a reference (e.g., a logical dataflow) to a microcache 320. The flow look-up table 310 is illustrativelyembodied as a hash table, wherein client connection information ishashed to a specific entry 312. As a result, the contents of the look-uptable 310 function as indexes used by the NFS server 212 to reference(point to) logical data flows using the client connection information.Note that the information involved with the lookup operation may becached so that it can be accessed efficiently and accurately, and sothat the appropriate bin (microcache) can be identified quickly.

In an illustrative embodiment, each logical data flow is represented inmemory 124 by a flow data structure comprising two parts: (i) anidentifier structure or component (hereinafter “FlowID 322”) used by theNFS server 212 to identify a particular logical data flow, and (ii) amain body component (hereinafter “FlowT 330” and its supporting datastructures, as described herein) containing the actual reply cacheinformation and statistics for the data flow. Each FlowID 322 isconsidered part of a FlowT 330 with which it is associated. As usedherein, the terms “FlowID/FlowT pair” and “FlowT” may be usedinterchangeably. The FlowID/FlowT pair may continue to exist well beyonda connection (associated with the logical data flow) being closed by theclient, e.g., the pairing may exist for the life of the NFS server 212.

Specifically, the FlowID is the structure pointed to (referenced) by thehashed entry 312 of the flow look-up table 310. The client connectioninformation is stored in the FlowID 322 and illustratively includes (i)the client IP address, (ii) client port number, (iii) transport protocoland (iv) server IP address. Each microcache 320 of the reply cache 300is thus identifiable using information stored in an associated FlowID322. The FlowID 322 also contains all of the information needed for theNFS server to locate associated reply cache information in memory 124.

Once an entry 312 of the look-up table 310 is hashed, the server 212searches through a hash chain 324 (linked list) of FlowIDs 322referenced by that hashed entry for a matching logical data flow. Uponmatching on a FlowID, the NFS server may access the actual datastructures of the FlowT 330, e.g, in a 2-dimensional array fashion. Atthe core of each FlowT 330 is a microcache look-up table 340 used toeither locate free, available entries 352 for the microcache within aLRU list 350 or identify in progress entries 362 within an in-progressrequest list 360 allocated to each data flow. Note that the architectureof the flow based reply cache 300 provides two look-up (hash) tablesbecause there are two levels of indirection and cache look-upoperations. The first flow look-up table 310 is used to find the properlogical data flow or microcache 320, and the second microcache look-uptable 340 is used to find an available entry 352 in the microcache.

Illustratively, the NFS server 212 uses different portions ofinformation to perform the reply cache look-up operations. A firstportion (e.g., TCP and IP layer information) pertains to the clientconnection and is used to perform a flow look-up operation to the flowlook-up table 310. Thereafter, a second portion (e.g., RPC layerinformation) is used to discern duplicate requests, i.e., RPCinformation matching is used to determine if there is an entry in thereply cache (a duplicate request). The RPC information illustrativelyincludes (i) a transaction identifier (XID) of the request, (ii) aversion number of the RPC program (PROGNUM) executed on the storagesystem, and (iii) an RPC procedure number (PROC) of the action to betaken by the program.

For example, the microcache look-up table 340 contains entries 342(e.g., indexes) that point to each reply cache entry, illustrativelyindexed using a hash based on the XID and matching based on XID, PROC,PROGNUM and a checksum. (Note that if the XID matches to a unique entry,the comparison stops. But if such matching does not allow identificationto a unique entry, then matching based on the additional RPC informationis needed.) The XID is chosen by the client to collate requests andreplies. The XID is illustratively embodied as a value that incrementswith each client request and is seeded in a pseudo-random manner at boottime. The XID may be initialized to a predetermined value by the client;however, the XID is illustratively initialized as a pseudo-random numbereach time the client boots.

In an illustrative embodiment, each microcache 320 has a fixed number ofreply cache entries specified by a system-defined constant, e.g., 100entries as initially allocated to each newly created flow. When residingin the cache, a reply cache entry is on one of two lists, the LRU list350 or the in-progress list 360. An entry on the in-progress list 360denotes that a request has been received and is currently beingprocessed by the file system 240, i.e., the NFS server 212 is waitingfor the reply data to be loaded into the entry. Illustratively, theentry is marked as being in existence (assigned to a request) but nothaving a result of the processing and, thus, no reply has been sent tothe client.

The LRU list 350 is illustratively an “age ordered” list (e.g., adoubly-linked list) that has a number of possible links (one to a nextLRU entry that is younger, another to a previous LRU entry that is olderand another to the end or tail of the LRU list, e.g., if the entry isthe oldest). The LRU list is a property of the data flow, i.e., the flowmaintains information about the head and tail of its LRU list. If a newentry is needed for a new request of the flow, the LRU list is consultedto find the oldest entry on the list and that entry is assigned to therequest. The LRU list is provided on a per flow (per bin) basis, i.e.,one LRU list per flow.

As noted, each entry 342 of the microcache look-up table 340 is an indexthat points to a reply cache entry; however, that entry may assumedifferent states. For example, a reply cache entry may be assigned arequest that has been processed by the file system 240 and updated withreply data. Accordingly, the entry holds valid reply data within thereply cache and thus assumes a valid state as represented by entry 352of the microcache LRU list 350. (Note that, depending on the amount oftime it has been residing in the cache, the entry 352 may thereafterassume an available state.) In contrast, the reply cache entry may be inan in-progress state and thus represented by an entry 362 of in-progresslist 360. When a new cacheable request arrives from a client 110, theNFS server 212 removes a cache entry 352 associated with that clientconnection from the end of the LRU list 350, updates that entry withinformation pertaining to the request and inserts that entry on thein-progress list 360.

In an illustrative embodiment, the server maintains a “high watermark”of in-progress entries 362 in each microcache 320 to thereby provide anindication of the depth of that client's microcache. The high watermarkis illustratively less than the full estimated depth (e.g., one third)and provides an indication to the server 212 of the number ofoutstanding requests from that client 110. For example, if the highwatermark reaches a certain number of in-progress entries and does notincrease from that mark, then the server has a sufficient indicationthat this may be the maximum number of requests that the client willhave outstanding at one time. In addition, if the server 212 hasreceived a request and has not sent a reply, the server knows that therequest is outstanding. If the client has more outstanding requests thanentries allocated in its microcache, then the server may allocateadditional entries for the microcache 320. Illustratively, the servermay grow (increase) the microcache by a predetermined amount, e.g, onethird.

When the request is completed, the entry is transferred from thein-progress list 360 to the beginning or head of the LRU list 350(because it is the most recently used) and the server 212 populates theentry's associated protocol reply structure with any information neededto formulate a response or reply to the original NFS request. In otherwords, once the file system has finished processing the request, the NFSserver returns a reply to the client 110. The reply cache entry thentransitions from the in-progress state to a valid state, the reply datais loaded into the entry and the entry is inserted onto the LRU list350.

Operation of Flow Based Reply Cache

FIGS. 4A and 4B are flowcharts illustrating an operational procedure forthe flow based reply cache according to the present invention. Inparticular, the procedure is directed to a cache insertion operationusing the flow based reply cache architecture and structures describedherein. The procedure 400 starts at Step 402 and proceeds to Step 404where the protocol server (e.g., NFS server 212) receives a new,non-idempotent request (e.g., a NFSv3 request) from a client which doesnot yet have any entries in the reply cache 300. In Step 406 the NFSserver computes a hash value for the request, e.g., based on theclient's connection information, and in Step 408, selects an appropriateflow look-up table 310 using the hash value computed from a conventionalhash function. As noted, the hash value for the client connection isillustratively based on the client's IP address and port number, as wellas transport protocol (TCP or UDP) used and the server's IP address.

In Step 410, the server searches the hash chain 324 for a FlowID 322that matches the client's connection information. Here, the hash valuemay be matched to an entry 312 of the table 310 and the matching entrypoint to (reference) the hash chain 324 of FlowIDs 322. If there is nomatch in Step 412, e.g., there is no reference to a FlowID/FlowT pairfor the client in the flow look-up table 310, the NFS server obtains anew FlowID/FlowT pair structure (i.e., microcache 320) from, e.g., afree list (not shown) in Step 414 and, in Step 416, inserts thestructure into the FlowID hash chain 324. Note that if memory structuresof the free list are needed, the server may issue a memory allocation(malloc) request to a memory manager of the storage system. Note alsothat entries (structures) 352 removed from the LRU list 350 are loadedinto the free list, thereby effectively placing them back in anavailable pool.

Otherwise, in Step 418, the NFS server searches for a matching entry inthe microcache look-up table 340 of the microcache 320 of the replycache 300. That is once a microcache (or bin) has been located, theserver performs a look-up operation for the request (or bucket) in themicrocache look-up table 340. If there is a match in Step 420, i.e., thenew request is a retransmission of a previous operation request, a replyfor that previous request may be in the reply cache and the procedurejumps to Step 430. Otherwise, the microcache is empty and therefore theserver does not find a match. Accordingly, in Step 422, the server 212removes the first cache entry 352 from the LRU list 350 and, in Step424, inserts that entry as entry 362 on the in-progress list 360. Notethat the server 212 moves the entry to the LRU list only when therequest has been completed and the reply sent back to the client. Notealso that if all of the entry structures on the LRU list 350 arepopulated, the server 212 retrieves the least recently used entry fromthe list.

In Step 426, the NFS server populates the cache entry 362 with, e.g.,information from the RPC request and, in Step 428, inserts a reference(entry 342) to the populated entry into the microcache look-up table340. Specifically, the cache entry is populated and inserted onto themicrocache look-up table, so that the entry 342 can be indexed based onthe hash of the XID (unique label of the bucket). Note that the cacheentry 342 is not inserted onto the look-up table 340 in any particularorder. Illustratively, the microcache look-up entry 342 remains validuntil the reply cache entry 360 is removed from the LRU list 360 andrecycled for a new request.

In Step 430, the NFS server receives an I/O completion from the filesystem and, in Step 432, the NFS server 212 locates the logical dataflow (microcache) relating to the operation request, e.g., in the flowlook-up table 310. Here, the file system 240 has finished processing therequest and generated a reply to the request. The reply cache entry 362remains on the in-progress list 360 until the file system returns theI/O reply message indicating that (processing of) the request iscomplete. When receiving such a message, the NFS server associates therequest with the connection on which the request originated. The server212 then uses the client connection information stored in the I/Omessage to locate the logical data flow corresponding to thatconnection. In particular, the message is received from the file system(instead of the network) and the two-step look-up operation previouslydescribed is performed to find the FlowID, FlowT and the entry.

In Step 434, the NFS server searches for the in-progress entry 362using, e.g., the XID hash. Illustratively, a cacheable request shouldalways be on the in-progress list 360 when the I/O reply message arrivesfrom the file system 240. Upon locating the entry 362, the server 212moves the entry from the in-progress list 360 to the LRU list 350 inStep 436. Note that the request's corresponding entry 342 in themicrocache look-up table 340 remains unchanged. In Step 438, the NFSserver stores reply (and protocol) data in the cache entry (nowrepresented as entry 352) and the procedure ends at Step 440. At thispoint, the completed NFS request is considered entered into themicrocache 320 of the reply cache 300. Reply data is loaded into thecache entry 352 so that the server can reply to any duplicate requestswithout consulting the file system.

Advantageously, the flow based reply cache obviates potentially harmfulduplication of processing by the storage system, while also addressingmany issues associated with the use of a global LRU design. Byseparating reply cache data based on client connection information,issues related to entry starvation for retransmitting clients can beavoided. Moreover, by organizing reply cache entries based on logicaldata flows, each microcache can be managed independently, and cacheentry expiration becomes a factor of client request rate, rather thanoverall system request rate.

While there have been shown and described illustrative embodiments for aflow based reply cache, it is to be understood that various otheradaptations and modifications may be made within the spirit and scope ofthe present invention. For example, in response to receiving a messagefrom the file system 240 that it has finished processing a request andgenerated a reply to the request, the protocol server (e.g., NFS server212) may maintain a pointer to the appropriate entry 362 on thein-progress list 360 in order to locate the logical data flowcorresponding to client connection information contained in the message.That is, in an alternative embodiment of the invention, a performanceoptimization may be realized by the server 212 maintaining a pointerrather than performing the two-step look-up operation described hereinto find the FlowID, FlowT and the entry.

The foregoing description has been directed to specific embodiments ofthis invention. It will be apparent, however, that other variations andmodifications may be made to the described embodiments, with theattainment of some or all of their advantages. For instance, it isexpressly contemplated that the components and/or structures describedherein can be implemented as software, including a computer-readablemedium having program instructions executing on a computer, hardware,firmware, or a combination thereof. Accordingly this description is tobe taken only by way of example and not to otherwise limit the scope ofthe invention. Therefore, it is the object of the appended claims tocover all such variations and modifications as come within the truespirit and scope of the invention.

What is claimed is:
 1. A method for providing a reply cache of a storagesystem, comprising: organizing the reply cache into one or moremicrocaches, each microcache having one or more reply cache entries;allocating a microcache of the one or more microcaches to a client ofthe storage system, wherein the client has its own pool of reply cacheentries; in response to a non-idempotent request from the client,replying to the non-idempotent request using an entry in the pool ofreply cache entries, wherein the non-idempotent request is preventedfrom executing on the storage system, wherein the pool of reply cacheentries includes an in-progress list and a least recently used (LRU)list, wherein a completed entry on the in-progress list is transferredto the LRU list, wherein the entry is complete when it is processed bythe storage system; and increasing a size of the microcache allocated tothe client when a number of entries on the in-progress list exceeds athreshold.
 2. The method of claim 1 wherein allocating the microcache tothe client comprises identifying the client by a logical data flow. 3.The method of claim 2 wherein the logical data flow comprises aconnection identifier identifying the client and a connection identifieridentifying the storage system.
 4. The method of claim 2 furthercomprising: matching the client request to a logical data flowidentifier, wherein the logical data flow identifier has a clientconnection identifier.
 5. The method of claim 1 further comprising:loading a reply associated with a previous client request into an entryof the client microcache.
 6. The method of claim 4 further comprising:matching the client request against the pool of reply cache entries inthe microcache.
 7. The method of claim 1 wherein the client request isvia a protocol selected from a group of protocols consisting of CIFS,NFS, iSCSI, and SCSI over Fibre Channel.
 8. The method of claim 1,wherein the client request is identified using a transaction identifier.9. The method of claim 1 further comprising: increasing the size of theclient microcache.
 10. The method of claim 4 wherein matching the clientrequest to a logical data flow identifier comprising: performing a hashof the client request.
 11. The method of claim 8, further comprising:computing a hash of the transaction identifier of the client request.12. The method of claim 8, wherein the transaction identifier is a datastructure including one or more integers.
 13. A system configured toprovide a reply cache of a storage system, comprising: a protocol serverconfigured to execute on the storage system, the protocol serverconfigured to maintain one or more microcaches of the reply cache, eachmicrocache comprising one or more reply cache entries, the protocolserver further configured to allocate each microcache to a client of thestorage system, wherein each client has its own pool of reply cacheentries, the protocol server further configured to respond to anon-idempotent request from a client using an entry in the pool of replycache entries, wherein the protocol server prevents the non-idempotentrequest from executing on the storage system, wherein the pool of replycache entries includes an in-progress list and a least recently used(LRU) list, wherein a completed entry on the in-progress list istransferred to the LRU list, wherein the entry is complete when it isprocessed by the storage system, and the protocol server furtherconfigured to increase a size of the microcache allocated to the clientwhen a number of entries on the in-progress list exceeds a threshold.14. The system of claim 13 wherein the protocol server comprises aprotocol server selected from a group of protocol servers consisting ofCIFS server, NFS server, iSCSI server, and SCSI over Fibre Channelserver.
 15. The system of claim 13 further comprising a flow look-uptable having one of more entries, wherein each entry includes areference to a first microcache of the one or more microcaches.
 16. Thesystem of claim 15 wherein the reference comprises a logical data flowallocated to the first microcache.
 17. The system of claim 16 whereinthe logical data flow is represented by a flow data structure comprising(i) an identifier component identifying the logical data flow, and (ii)a main body component having reply cache information for the logicaldata flow.
 18. The system of claim 17 wherein the main body componentcomprises a microcache look-up table having the LRU list, and thein-progress list, wherein the in-progress list identifies in-progressentries allocated to the logical data flow, and wherein the LRU listidentifies complete entries allocated to the logical data flow.
 19. Thesystem of claim 13 wherein the LRU list comprises an age ordered list.20. The system of claim 18 wherein an in-progress entry on thein-progress list denotes that a request has been received from theclient and is being processed by a file system executing on the storagesystem.
 21. The system of claim 20 wherein the protocol server isfurther configured to mark the in-progress entry as being assigned tothe request, but that no reply has been sent to the client.
 22. Thesystem of claim 13, wherein the protocol server is further configuredto: identify the client request using a transaction identifier.
 23. Thesystem of claim 13, wherein the protocol server is further configuredto: increase the size of the client microcache.
 24. The system of claim22, wherein the transaction identifier is a data structure including oneor more integers.
 25. A computer readable medium containing executableprogram instructions for providing a flow based reply cache of a storagesystem, the executable instructions comprising one or more programinstructions for: organizing the flow based reply cache into one or moremicrocaches, each microcache comprising a plurality of reply cacheentries; allocating each microcache to a client of the storage system,wherein the client has its own pool of reply cache entries; in responseto a non-idempotent request from the client, replying to thenon-idempotent request using an entry in the pool of reply cacheentries, wherein the non-idempotent request is prevented from executingon the storage system, wherein the pool of reply cache entries includesan in-progress list and a least recently used (LRU) list, wherein acompleted entry on the in-progress list is transferred to the LRU list,wherein the entry is complete when it is processed by the storagesystem; and increasing a size of the microcache allocated to the clientwhen a number of entries on the in-progress list exceeds a threshold.26. The computer readable medium of claim 25 wherein the programinstruction for allocating each microcache to the client comprises oneor more program instructions for identifying each client by a logicaldata flow.
 27. The computer readable medium of claim 26 furthercomprising one or more program instructions for: using at least onemicrocache of the one or more microcaches to identify a stream of clientrequests associated with the logical data flow; and using the at leastone microcache to isolate the client stream from other client streamsand associated logical data flows used to deliver other requests servedby the storage system.
 28. The computer readable medium of claim 27further comprising one or more program instructions for: loading repliesassociated with the client requests into the entries of the at least onemicrocache.